library(knitr)
library(here)
## here() starts at /Users/blake.feist/Documents/GitHub/VMS-pipeline

Master VMS and Fish Ticket Processing Workflow

This document collates all individual steps of VMS/Fish Ticket processing for a chosen year.

#Kelly adding stuff to see if it commits to the repo

Choose Fishery

Indicate the PacFIN codes for the fish ticket species you are interested in tracking, as well as the gears. Landings and revenue for these species and gears will be tracked in the pipeline, for both targeted and non-targeted trips. Keep in mind that although the pipeline will record all species that are landed on each fish ticket, exact landings and revenues for these species will ONLY be reported if they are included in spp_codes. Likewise, gear_codes should be thought of NOT as a filter on which gears you get out of the pipeline, but rather as all of the gear types for which you would like total revenue and landings, regardless of target species.

spp_codes <- c("DCRB","CHNK") # All of the species for which you want specific revenue and landings by ticket
# spp_codes_name <- "DCRB_CHNK"
# here's your options
#https://pacfin.psmfc.org/pacfin_pub/data_rpts_pub/code_lists/sp.txt
gear_codes <- c("SHELLFISH POT (CRAB)","CRAB OR LOBSTER POT","CRAB POT","CRAB RING","TROLL, (SALMON)","SET NET","OCEAN TROLL","TROLL (SALMON)","GILL NET (SALMON)","DRAG SEINE","DIP BAG NET","HOOK AND LINE","PURSE SEINE (SALMON)")
# All of the gears for which you want total revenue and landings by ticket
# As implemented, these are the full names (i.e., the AGENCY DESCRIPTION" column of the table linked here, e.g. "CRAB POT", not "CPT")
# https://pacfin.psmfc.org/pacfin_pub/data_rpts_pub/code_lists/agency_gears.txt

Indicate the metric to use in determining targets of fishing trips, as well as the threshold for determining the target of each fishing trip.

# exvessel_revenue for now
pacfin_weight_metric <- "LANDED_WEIGHT_LBS" # another option is "LANDED_WEIGHT_MTONS"
pacfin_revenue_metric <- "EXVESSEL_REVENUE" # another option is AFI_EXVESSEL_REVENUE
# target_metric <- "revenue" # afi_revenue is an alternative option for newer fish tickets, but use caution as you need to know reference year

## how much "more important" does your target need to be than the species with the second greatest catch? Expressed as a ratio.

target_cutoff <- 1

## do you want to report revenue and lbs for the given target species? (dynamic)
# include_target <- TRUE

Choose Year to Process

This parameter will get passed through all processing steps

# Leena will be working through all steps for 2014 as a test
alltime <- proc.time()
process_year <- 2010

Lookback Window

We enforce a “lookback window” when matching the VMS and fish tickets such that if there are more than X days [an option that can be changed] between successive tickets for the same vessel, we only include data for those X days. Look for the lookback window option in the master process script.

Create lookback object. Here is where the lookback window duration can be changed. Default is 7 days.

lookback_window <- 7 # in days
year_begin <- lubridate::ymd(paste0(process_year,"-01-01"), tz= "America/Los_Angeles")
year_end <- lubridate::ymd(paste0(process_year,"-12-31"), tz= "America/Los_Angeles")

if(process_year == 2009){
  lookback_begin <- year_begin
} else{
  lookback_begin <- year_begin - lubridate::days(lookback_window)
}

Step 1: Process Raw Fish Ticket Data

Date written: 11/2018 Updated: 11/2019 By: M. Fisher, O.R. Liu

Purpose

This step pre-processes fish tickets from PacFIN for eventual matching to VMS records.

A few key aspects of the processing code:

  1. You can filter out fish tickets with less than a minimum exvessel revenue or landed pounds. This could help remove personal catch, or catch used for monitoring purposes (like domoic acid testing for Dungeness crab). This only removes fish tickets based on total exvessel revenue or total landed pounds, not species-specific minimums.

  2. When calculating species-specific landed pounds and exvessel revenue as separate columns, the code provides several options. It is possible to leave in all species listed on the fish ticket, which would create a different column for each species (around 300; not recommended). You can also select for species which you are particularly interested in. These species will each get their own column, and all other species will be grouped into “other”. If you want to break down the “other” category by gear type (i.e. other species caught by pot or trap gear, as is default here), then you can specify gear types in the gear types vector. Your choice of which species to highlight for this section will not impact the calculations of target species. In addition to reporting landed pounds and revenue for pre-determined species, you can also have a dynamic column that reports the same information for the target species of that trip.

  3. Target species is calculated using the maximum landed pounds and exvessel revenue across all possible species caught. A target species will only be identified if the target’s landed pounds / exvessel revenue are more than 10% greater than the next greatest value. Otherwise, the target will be set to “none”. You will need to choose whether you want the target to be calculated using landed pounds v. exvessel revenue. +

  4. All gear types listed for a specific trip are combined into the columns gear_name_all and gear_code_all, separated by a “/”. This prevents retention of duplicate fish tickets, but allows for later filtering.

  5. If you want to reorder the columns in the data frame before writing it out, this can be done in the second to last code chunk, labeled (reorder_dataframe)

  6. The code filters out any fish tickets with missing vessel identifiers (VESSEL_NUM)

Set Up Workspace

Required packages

library(tidyverse)
library(foreign)
library(qdapTools)
library(lubridate)
library(reshape2)
library(janitor)
library(here)
library(magrittr)

# ggplot theme
plot_theme <-   theme_minimal()+
  theme(text=element_text(family="sans",size=12,color="black"),
        legend.text = element_text(size=14),
        axis.title=element_text(family="sans",size=14,color="black"),
        axis.text=element_text(family="sans",size=8,color="black"),
        panel.grid.major = element_line(color="gray50",linetype=3))
theme_set(plot_theme)

keepers <- c('process_year','alltime','spp_codes','gear_codes','pacfin_weight_metric','pacfin_revenue_metric','target_cutoff','lookback_window','year_begin','year_end','lookback_begin')

rm(list=setdiff(ls(),keepers))

#confidential <- "/Users/kelly.andrews/Documents/GitHub/VMS-pipeline/Confidential"

Choose directories and set objects

## which years of fish tickets are you interested in?
# process_year = 2021
# process_year = 2018 #Create a version of 2018 that has FTID

## which species-specific revenue and lbs columns would you like? (static)
# species_cols <- spp_codes

## calculate an "other" category according to a certain gear type? If so, which gear?
# gear_types <- gear_codes

## would you like target based on exvessel revenue or landed lbs?
# target_metric <- "revenue"

## how much "more important" does your target need to be than the species with the second greatest catch? Expressed as a ratio. (default: target species catch / landings must be 10% greater than second greatest species catch / landings = 1.1)
# target_cutoff <- 1.1

## do you want to report revenue and lbs for the given target species? (dynamic)
# include_target <- TRUE

Read in Data

This should be a .csv file containing raw PacFIN data. We only read in columns that we need.

## for running 2020 and 2021 pipeline, all fish tickets are already together:
rawdat <- read_rds(here::here('Confidential','raw','fish tickets','all_fishtickets_1994_2023.rds'))
rawdat <- rawdat %>%
  filter(LANDING_YEAR %in%  process_year)

#earlier years:
#tixfiles <- list.files(here::here('data','raw','fish tickets'),full.names = T)
# tixfiles <- list.files(here::here('data','raw','fish tickets','has FTID'),full.names = T)
# 
# rawdat <- purrr::map_df(tixfiles,function(fl){
#   read_csv(fl,col_types= cols_only(
#     FISH_TICKET_ID = col_double(), 
#     FTID = col_character(), #Also include FTID column to compare between logbooks - won't work if column doesn't appear in ALL raw fish ticket files
#     PACFIN_PORT_CODE= col_character(), 
#     PACFIN_GROUP_PORT_CODE= col_character(), 
#     VESSEL_NUM= col_character(), 
#     AGENCY_CODE= col_character(), 
#     GEAR_CODE= col_double(),
#     GEAR_NAME= col_character(), 
#     PACFIN_GROUP_GEAR_CODE= col_character(), 
#     REMOVAL_TYPE_CODE= col_character(), 
#     REMOVAL_TYPE_NAME= col_character(), 
#     LANDING_DATE= col_character(),
#     LANDING_MONTH= col_double(), 
#     LANDING_YEAR= col_double(), 
#     PACFIN_SPECIES_CODE= col_character(), 
#     LANDED_WEIGHT_LBS= col_double(), 
#     EXVESSEL_REVENUE= col_double()))
# })
# 
# rawdat <- rawdat %>%
#   filter(LANDING_YEAR %in%  process_year)

Check to make sure columns were in the correct format for reading. This will return the names of columns with parsing errors.

problems(rawdat) %>% select(col) %>% distinct()
## # A tibble: 0 × 1
## # ℹ 1 variable: col <int>
# The columns with parsing errors are usually from columns that we do not need anyway

Edit existing columns

First, subset the raw data to include only the columns that are needed. Rename the columns that will be retained in the final processed data. The last three columns are used to calculate per-species / total revenue and landed weight, but will not ultimately be retained.

rawdat.sub <- rawdat %>% 
  dplyr::select(FISH_TICKET_ID, FTID, PACFIN_PORT_CODE, PACFIN_GROUP_PORT_CODE, VESSEL_NUM, AGENCY_CODE, GEAR_CODE, GEAR_NAME, PACFIN_GROUP_GEAR_CODE, REMOVAL_TYPE_CODE, REMOVAL_TYPE_NAME, LANDING_DATE, LANDING_MONTH, LANDING_YEAR, PACFIN_SPECIES_CODE, LANDED_WEIGHT_LBS, EXVESSEL_REVENUE) %>% 
# change some column names
  set_colnames(c("Rec_ID", "FTID", "pacfin_port_code", "port_group_code","drvid", "agency_code","gear_code", "gear_name", "gear_group", "removal_type_code", "removal_type_name", "date", "month", "year", 
                          "PACFIN_SPECIES_CODE", "LANDED_WEIGHT_LBS", "EXVESSEL_REVENUE"))

Remove the columns where the vessel identifier (drvid) is either “UNKNOWN” or blank (““)

rawdat.sub %<>%
  filter(drvid != "UNKNOWN") %>%
  filter(drvid != "")

Adjust gear group codes for some uncategorized gears

rawdat.sub %<>%
  mutate(gear_group=case_when(
    gear_group=='MSC' & gear_name %in% c('SPEAR','DIVING - ABALONE IRON','DIVING - RAKE/HOOKS SEA URCHINS','DIVING', 'SHELLFISH DIVER') ~ 'DVG',
    gear_group=='MSC' & gear_name %in% c('UNKNOWN','UNKNOWN OR UNSPECIFIED GEAR') ~ 'USP',
    gear_group=='MSC' & gear_name %in% c('AQUACULTURE FARM','OYSTER FARM','CLAM FARM') ~ 'FRM',
    TRUE ~ gear_group
  ))

Concatenate gear / catch information

Concatenate all species information for the fish ticket.

all.species <- rawdat.sub %>%
  group_by(Rec_ID, removal_type_name) %>%
  summarise(species_code_all = ifelse(length(unique(PACFIN_SPECIES_CODE)) > 1, paste(unique(PACFIN_SPECIES_CODE), collapse="/"), as.character(unique(PACFIN_SPECIES_CODE))))

Concatenate the gear information for the fish ticket

gear.info <- rawdat.sub %>%
  group_by(Rec_ID, removal_type_name) %>%
  summarise(gear_name_all = ifelse(length(unique(gear_name)) > 1, paste(unique(gear_name), collapse="/"), as.character(unique(gear_name))),
            gear_code_all = ifelse(length(unique(gear_name)) > 1, paste(unique(gear_code), collapse="/"), as.character(unique(gear_code))))

Find Target Species

We need to define the target species for each landed ticket. We will do this by finding the species with the greatest landed revenue for each trip.

Right now, each row of the data is a landing amount for a particular gear/ticket/species combo. We want to collapse these tickets in order to just have one row for each ticket, with an associated amount of landings for the TARGET species.

rawdat.sub %>% 
  count(Rec_ID) %>% 
  ggplot(aes(n,..count..))+
  geom_histogram(fill="seagreen",bins=30)+
  labs(x="Number of records per ticket","kernel density")

Calculate landed pounds and revenue by species for each fish ticket. Then use these data to define proportions of pounds and revenue by species/ticket, in order to designate a “target” species for each fishing trip. We denote the target species for both revenue and pounds landed. If a ticket does not have a definitive target (classified as a proportion of landed pounds or revenue that is >10 percent more than the second-place species), denote “NONE” for the target.

pacfin_weight_metric <- sym(pacfin_weight_metric)
pacfin_revenue_metric <- sym(pacfin_revenue_metric)

rawdat.w.targets <- rawdat.sub %>% 
  distinct() %>% 
  # Group by ticket, removal type, and species
  group_by(Rec_ID,removal_type_code,removal_type_name,PACFIN_SPECIES_CODE) %>% 
  
  # calculate landed pounds and revenue by species
  summarise(spp_lbs=sum({{pacfin_weight_metric}}),
            spp_revenue=sum({{pacfin_revenue_metric}})) %>% 
  ungroup() %>% 
  
  # now, calculate total pounds per species across the entire ticket
  group_by(Rec_ID,PACFIN_SPECIES_CODE) %>% 
  mutate(tot_lbs_spp=sum(spp_lbs),
         tot_revenue_spp=sum(spp_revenue)) %>% 
  ungroup() %>% 
  
  # using these species totals, calculate proportions of total catch belonging to each species
  # by lbs landed and revenue
  group_by(Rec_ID) %>% 
  mutate(prop_lbs_spp=tot_lbs_spp/sum(tot_lbs_spp),
         prop_revenue_spp=tot_revenue_spp/sum(tot_revenue_spp)) %>% 
  
  # finally, assign a TARGET to the trip, defined as the species with the
  # LARGEST proportion of revenue for that trip
  # If a species landed is not >10% more than the second species, target is NONE
  mutate(first_rev=dplyr::first(prop_revenue_spp,order_by = desc(prop_revenue_spp)),
         second_rev=dplyr::nth(prop_revenue_spp,n=2,order_by = desc(prop_revenue_spp)),
         first_rev_spp=dplyr::first(PACFIN_SPECIES_CODE,order_by= desc(prop_revenue_spp)),
         second_rev_spp=dplyr::nth(PACFIN_SPECIES_CODE,n=2,order_by= desc(prop_revenue_spp)),
         first_lbs=dplyr::first(prop_lbs_spp,order_by = desc(prop_lbs_spp)),
         second_lbs=dplyr::nth(prop_lbs_spp,2,order_by = desc(prop_lbs_spp)),
         first_lbs_spp=dplyr::first(PACFIN_SPECIES_CODE,order_by=desc(prop_lbs_spp)),
         second_lbs_spp=dplyr::nth(PACFIN_SPECIES_CODE,n=2,order_by= desc(prop_lbs_spp))) %>% 
  
  # check if first is >10% more than second, for revenue and landed lbs
  # or, if first and second species are the same (i.e. for a ticket with both commercial and personal use catch)
  # if so, assign that species as TARGET
  
  mutate(TARGET_rev=ifelse(is.na(first_rev/second_rev)|(first_rev/second_rev)>=target_cutoff|first_rev_spp==second_rev_spp,first_rev_spp,"NONE"),
         TARGET_lbs=ifelse(is.na(first_lbs/second_lbs)|(first_lbs/second_lbs)>=target_cutoff|first_lbs_spp==second_lbs_spp,first_lbs_spp,"NONE")) %>% 
  ungroup() %>% 
  select(-(first_rev:second_lbs_spp))

# Add back in dates, vessel IDs, etc.##Added FTID in this list
recID_attributes <- rawdat.sub %>% 
  select(Rec_ID,FTID,PACFIN_SPECIES_CODE,pacfin_port_code,port_group_code,gear_code,gear_name,gear_group,drvid,agency_code,date,month,year)
  
rawdat.w.targets %<>%
  left_join(recID_attributes,by=c("Rec_ID","PACFIN_SPECIES_CODE"))

# add all species and gear types for each ticket
rawdat.w.targets %<>%
  left_join(all.species) %>% 
  left_join(gear.info)

Calculate “Other” Category

For each fish ticket, sum up an “Other” category based on the choice made in the beginning of the script (gear_codes)

dat_targets_other <- rawdat.w.targets %>% 
  distinct() %>% 
  group_by(Rec_ID) %>% 
  # denote whether a record is in the Other category
  mutate(is_other=gear_name %in% gear_codes) %>% 
  
  # sum the other category for each ticket
  mutate(other_rev=sum(tot_revenue_spp[is_other]),
         other_lbs=sum(tot_lbs_spp[is_other]))

Add Species-Specific Columns

Here we add some species-specific columns based on the choice made in the beginning of the script (species_cols).

# Use pivoting to define pounds and revenue for each species of interest
RecID_spp_cols <- dat_targets_other %>% 
  select(Rec_ID,PACFIN_SPECIES_CODE,tot_lbs_spp,tot_revenue_spp) %>% 
  filter(PACFIN_SPECIES_CODE %in% spp_codes) %>%
  distinct() %>% 
  pivot_longer(tot_lbs_spp:tot_revenue_spp) %>% 
  mutate(type=ifelse(str_detect(name,'lbs'),'lbs','revenue')) %>% 
  select(-name) %>% 
  group_by(Rec_ID) %>% 
  # spread to create the columns
  pivot_wider(names_from=c(PACFIN_SPECIES_CODE,type),values_from = value,values_fill = list(value=0))

# add back to main data
dat_targets_other %<>%
  left_join(RecID_spp_cols)

Add Trap/Pot/Ring designation

If the gear type is a trap or pot or ring, denote this

dat_targets_other %<>%
  mutate(TRAP.OR.POT.OR.RING=str_detect(gear_name,"TRAP")|str_detect(gear_name,"POT")|str_detect(gear_name,"RING"))

Format Dates

Change existing date columns to date objects and format.

dat_targets_other %<>%
  mutate(date=as.Date(date,"%d-%b-%y")) %>% 
  mutate(month=lubridate::month(date),
         month=month.name[month],
         year_mo=paste(year,lubridate::month(date),sep="_"),
         jdate=yday(date),
         year_jdate=paste(year,jdate,sep="_"),
         Week=week(date),
         year_Wk=ifelse(Week < 10, paste0(year,"_0",Week), paste0(year,"_",Week)))

Duplicated Dates

We add two columns that denote how many tickets were submitted by a vessel in a given day. n_tix denotes the total number of tickets, while n_nonpers_tix denotes the total number of tickets NOT for personal use

dat_out <- dat_targets_other %>% 
  
  # boolean indicating whether a record is for personal use
  mutate(is_personal=removal_type_name=="PERSONAL USE") %>% 
  
  # for each vessel and date
  group_by(drvid,date) %>% 
  
  # calculate quantities of interest
  mutate(n_tix=n_distinct(Rec_ID),
         n_pers=n_distinct(Rec_ID[is_personal]),
         n_nonpers_tix=n_tix-n_pers) %>% 
  select(-n_pers,-is_personal) %>% 
  ungroup()

Reorder Output

Clean up the output by reording the columns of interest

#Added FTID to select list
dat_out %<>%
  select(-(spp_lbs:prop_revenue_spp),-is_other) %>% 
  select(Rec_ID,FTID,pacfin_port_code,port_group_code,drvid,agency_code,removal_type_code,removal_type_name,gear_code_all,TRAP.OR.POT.OR.RING,date,month,year,year_mo,jdate,year_jdate,Week,year_Wk,contains("_lbs"),contains("_rev"),species_code_all,other_lbs,other_rev) %>% 
  # the last distinct() removes records for which we no longer care about individual sub-trip species catches (we only care about catched of the focal species, plus total catch and TARGETs)
  distinct()

Save

Save the data!

#Save a separate version of output with FTID still included
yrs_of_data <- unique(dat_out$year)
for(y in 1:length(yrs_of_data)) {
 dat_out %>% 
    filter(year==yrs_of_data[y]) %>% 
    write_rds(here::here('Confidential','processed','fish tickets',paste0(yrs_of_data[y],"fishtix_withFTID",".rds",collapse=""))) 
}
# for(y in unique(rawdat$LANDING_YEAR)){
#   year.final.processdat <- filter(final.processdat, year == y )
#   write_csv(file = paste0(processdir, "fish tickets ", y, " processed_multispecies03.csv"), x = year.final.processdat, row.names = FALSE)
#   cat("wrote out file for ", y, "\n")
# }
x<-proc.time()-alltime

So far, this pipeline for 2010 VMS data has taken 5.74 minutes to run.

Step 2: Report Vessel Lengths

Purpose

Calculate vessel lengths based on vessel registration data, and then add in a few columns to the fish ticket data providing vessel length for each ticket.

We use a decision tree for calculating vessel length from PacFIN data. The template allows for one year or multiple years of fish tickets to be run at the same time.

Currently, the code writes out processed vessel length keys and combined vessel length/fish ticket landings data as .rds files. Code to write to .csv maintained but commented out.

Plotting of outputs is optional, indicated

Methods Overview:

  • Step 1: Pull the registration data up to two years prior to the fish ticket year (3 years total)

  • Step 2: Remove any registration entries with vessels larger than 200 feet and smaller than 10 feet

  • Step 3: Find the number of unique vessel lengths for all unique Vessel Number / Agency Code combinations.

  • Step 4: Calculate final vessel length for all unique Vessel Number / Agency Code combinations

  • If vessel length data was available for 2+ years and MAX_LENGTH < MIN_LENGTH + 10, then assume this reflects an actual increase or decrease in vessel size.Final vessel length is mean of the two most recent vessel lengths

  • If vessel length data was available for only one year OR vessel length data was available for 2+ years, but MAX_LENGTH > MIN_LENGTH + 10. Pull registration data up to four years prior to the fish ticket year (5 years total). Then, if…

    1. One vessel length was recorded in 2+ years. Use this value as vessel length.
    2. Two different vessel lengths were recorded.
    3. MAX_LENGTH > 2 x MIN_LENGTH. <– this is probably a decimal point error that we would need to check manually to determine true length. Denote NA.
    1. MAX_LENGTH < 2 X MIN_LENGTH. <– Use the mode of these lengths
    2. Three or more different vessel lengths were recorded. Use the median of these lengths

Script Includes:

  1. Calculation of vessel lengths for X years of fish tickets.

  2. Summary of calculation output: proportion of fishing vessels missing lengths (per year), distribution of vessel lengths (per year and overall), vessel length against the number of years vessel length was recorded (per year), and the frequency of the calculation types used (per year and overall).

  3. Application of calculated vessel lengths to appropriate fish tickets.

Resources:

PMSC Fleet Report, esp Table 8 PacFIN Column names key


Setup and Options

Load packages

library(tidyverse)
library(here)
library(magrittr)
library(knitr)

# ggplot theme
plot_theme <-   theme_minimal()+
  theme(text=element_text(family="sans",size=12,color="black"),
        legend.text = element_text(size=14),
        axis.title=element_text(family="sans",size=14,color="black"),
        axis.text=element_text(family="sans",size=8,color="black"),
        panel.grid.major = element_line(color="gray50",linetype=3))
theme_set(plot_theme)

keepers <- c('process_year','alltime','spp_codes','gear_codes','pacfin_weight_metric','pacfin_revenue_metric','target_cutoff','lookback_window','year_begin','year_end','lookback_begin')

rm(list=setdiff(ls(),keepers))

options(dplyr.summarise.inform=F)

Source functions for vessel length calculations.

source(here::here('process steps',"Report_Vessel_Length_processedtix_functions.R"))

Enter processing options: Year desired to be processed, whether to only process Dungeness vessels, and whether to plot outputs.

# process_year <- 2020
plot_output <- TRUE

Vessel Permit Data

Load permits data

permits <- read_csv(here('Confidential','raw','vessel info','2009_2023_vesselreg.csv'))

Subset Permit Data

Choose columns of interest and subset the data frame

pcols <- c("VESSEL_NUM", "AGENCY_CODE","REGISTRATION_YEAR","VESSEL_LENGTH", "LENGTH_TYPE_CODE")
pthin <- permits %>%  select(all_of(pcols))

Change all entries of 777 for vessel length to NA (based on recommendation from Blake Feist)

pthin %<>%
  mutate(VESSEL_LENGTH=na_if(VESSEL_LENGTH,777))

Remove any vessel lengths greater than 200 feet or lower than 10 feet. Note - depending on the fishery, may want to change these cutoffs

pthin_length_filter <- pthin %>%
  filter(VESSEL_LENGTH < 200,VESSEL_LENGTH > 10)

Vessels of Interest

Read in raw fish ticket .rds files for each year of interest to data frame “landings”.

# import landings data
#Adjust import file name - use one that has FTID columns
landings <- purrr::map(process_year, function(y){
  read_rds(here::here('Confidential','processed','fish tickets',paste0(y,'fishtix_withFTID.rds')))
}) %>% 
  bind_rows()

# find relevant vessels
vessels <- landings %>%
  filter(drvid != "UNKNOWN", drvid != "MISSING", drvid != "", !is.na(drvid)) %>%
  select(drvid, agency_code, year) %>%
  distinct()

Calculate Vessel Lengths

Initiate empty data frame for all vessel length data across years

vessel_length_key_df <- tibble("drvid" = character(),
                               "agency_code" = character(),
                               "year" = numeric(),
                               "FINAL_LENGTH" = numeric(),
                               "TYPE_CALC" = character(),
                               "UNIQUE_LENGTHS" = numeric(),
                               "N_YEARS_LENGTH_RECORDED" = numeric(),
                               "HISTORIC_DATA" = character())

Do the calculation of vessel lengths.

for(y in unique(vessels$year)){
  # take only the vessels fishing in year "y"
  year_vessels <- filter(vessels, year == y)
  
  # identify the years of regulation data to pull
  target_reg_years <- seq(y-2, y)
  
  # subset for target registration years and summarise permits for each vessel
  cat("Calculating 3yr summary statistics for vessels in ", y, "\n")
  pthin_sumstats <- pthin_length_filter %>%
    filter(REGISTRATION_YEAR %in% target_reg_years,!is.na(VESSEL_LENGTH)) %>%
    group_by(VESSEL_NUM, AGENCY_CODE) %>%
    arrange(desc(REGISTRATION_YEAR)) %>%
    summarise(n_lengths = length(VESSEL_LENGTH),
              n_unique = length(unique(VESSEL_LENGTH)),
              max_length = max(VESSEL_LENGTH),
              min_length = min(VESSEL_LENGTH),
              mean2yr = get2yrmean(x=VESSEL_LENGTH, years=REGISTRATION_YEAR)) %>% 
    ungroup()
  
  # create empty vectors for this year
  final_vessel_lengths <- c()
  length_calc_vec <- c()
  n_unique_vec <- c()
  n_lengths_vec <- c()
  historic_vec <- c()
  processed <- 0
  
  cat("Calculating vessel lengths for fishing vessels in ", y, "...\n")
  # for each vessel fishing in this year #
  for(i in seq(1:length(year_vessels$drvid))){
    ## use the calc_length function (loaded from the "functions.R" file) to calculate vessel length
    tmp_vessel_length_info <- calc_length(permits=permits, vesseldat = year_vessels, lengthdat = pthin_length_filter, summarydat = pthin_sumstats, index = i)
    ## save the calc_length output to the appropriate position ("i") in the output vectors for this year
    n_lengths_vec[i] <- tmp_vessel_length_info[1] %>% as.numeric()
    n_unique_vec[i] <- tmp_vessel_length_info[2] %>% as.numeric()
    final_vessel_lengths[i] <- tmp_vessel_length_info[3] %>% as.numeric()
    length_calc_vec[i] <- tmp_vessel_length_info[4]
    ## if the vessel had to be calculated with historical data from over 5 years ago, a warning message will be saved in the calc_length output
    if(length(tmp_vessel_length_info) > 4){
      ### print the warning message
      print(tmp_vessel_length_info[5])
      ### save "Y" to the historic_vec for this year
      historic_vec[i] <- "Y"
    } else{ historic_vec[i] <- "N" }
    processed <- processed + 1
  }
  cat("Done processing", processed, "vessels for", y, "\n")
  # save allof the output vectors to a data frame for this year
  tmp_vessel_length_key_df <- tibble("drvid" = year_vessels$drvid,
                                     "agency_code" = year_vessels$agency_code,
                                     "year" = year_vessels$year,
                                     "FINAL_LENGTH" = final_vessel_lengths,
                                     "TYPE_CALC" = length_calc_vec,
                                     "UNIQUE_LENGTHS" = n_unique_vec,
                                     "N_YEARS_LENGTH_RECORDED" = n_lengths_vec,
                                     "HISTORIC_DATA" = historic_vec)
  ## bind this year's data frame to the end of the full data frame
  vessel_length_key_df <- bind_rows(vessel_length_key_df, tmp_vessel_length_key_df)
  cat("Wrote out", dim(tmp_vessel_length_key_df)[1], "lengths for", y, " to final data frame\n\n")
}
## Calculating 3yr summary statistics for vessels in  2010 
## Calculating vessel lengths for fishing vessels in  2010 ...
## [1] "ERROR: Vessel WRN32782 W COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel WRN02096 W COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel WRN00953 W COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel WRN00957 W COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel WRN00978 W COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel OR399HX W COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel AK8787AK W COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel WN8614NM W COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel WRN94113 W COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel WN149DH O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel WN9133LA O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "WARNING: Length for vessel 559050 O had to be calculated with data > 4 years prior to 2010"
## [1] "WARNING: Length for vessel 559050 O had to be calculated with data > 4 years prior to 2010"
## [1] "WARNING: Length for vessel OR224KG O had to be calculated with data > 4 years prior to 2010"
## [1] "WARNING: Length for vessel OR224KG O had to be calculated with data > 4 years prior to 2010"
## [1] "ERROR: Vessel WN5382GC O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel WN7443LE O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel WN9033NV O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel WN5243LF O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel WN4728LH O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "WARNING: Length for vessel OR399HX O had to be calculated with data > 4 years prior to 2010"
## [1] "WARNING: Length for vessel OR399HX O had to be calculated with data > 4 years prior to 2010"
## [1] "ERROR: Vessel 952234 O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel WN0691RU O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel 933626 O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel 593326 O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel 591876 O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel WN0447RU O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel OR243FM O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel WN9311NY O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel WN9430NL O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel WN8260LB O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel WN6083CG O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "WARNING: Length for vessel WN2963LA O had to be calculated with data > 4 years prior to 2010"
## [1] "WARNING: Length for vessel WN2963LA O had to be calculated with data > 4 years prior to 2010"
## [1] "ERROR: Vessel 578242 O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel 608950 O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel WN4665S O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel 588874 O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "WARNING: Length for vessel 585860 O had to be calculated with data > 4 years prior to 2010"
## [1] "WARNING: Length for vessel 585860 O had to be calculated with data > 4 years prior to 2010"
## [1] "ERROR: Vessel 592880 O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel WN7528JE O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel 552029 O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel 556956 O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel 932422 O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "WARNING: Length for vessel 687238 O had to be calculated with data > 4 years prior to 2010"
## [1] "WARNING: Length for vessel 687238 O had to be calculated with data > 4 years prior to 2010"
## [1] "ERROR: Vessel 602395 O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel WN6567SB O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel WN15DH O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel 223621 O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel 921572 O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel 685542 O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "WARNING: Length for vessel WN9650LH O had to be calculated with data > 4 years prior to 2010"
## [1] "WARNING: Length for vessel WN9650LH O had to be calculated with data > 4 years prior to 2010"
## [1] "ERROR: Vessel OR773ADR O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel 696433 O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel CF5533SC C COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel CF9293KB C COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "WARNING: Length for vessel CF0735GS C had to be calculated with data > 4 years prior to 2010"
## [1] "WARNING: Length for vessel CF0735GS C had to be calculated with data > 4 years prior to 2010"
## Done processing 3870 vessels for 2010 
## Wrote out 3870 lengths for 2010  to final data frame

Explore Vessel Lengths

How many fishing vessels WITH Vessel Numbers are missing calculated lengths?

cat(sum(is.na(vessel_length_key_df$FINAL_LENGTH)), 'vessels with Vessel ID numbers are missing calculated lengths, or', sum(is.na(vessel_length_key_df$FINAL_LENGTH)) / length(vessel_length_key_df$FINAL_LENGTH)*100, 'percent.')
## 46 vessels with Vessel ID numbers are missing calculated lengths, or 1.18863 percent.
for(y in unique(vessel_length_key_df$year)){
  for(a in unique(vessel_length_key_df$agency_code)){
    tmp_dat <- vessel_length_key_df %>%
      filter(agency_code == a) %>%
      filter(year == y)
    missing <- sum(is.na(tmp_dat$FINAL_LENGTH))
    cat("Number", y, "Vessels Missing Vessel Lengths for", a, ":", missing, "\n")
    cat("Proportion:", missing/length(tmp_dat$FINAL_LENGTH), "\n\n")
  }
}
## Number 2010 Vessels Missing Vessel Lengths for W : 9 
## Proportion: 0.008078995 
## 
## Number 2010 Vessels Missing Vessel Lengths for O : 35 
## Proportion: 0.02966102 
## 
## Number 2010 Vessels Missing Vessel Lengths for C : 2 
## Proportion: 0.001269036
# ggplot(data=filter(vessel_length_key_df, !is.na(FINAL_LENGTH)), aes(x=as.numeric(FINAL_LENGTH),y=as.numeric(N_YEARS_LENGTH_RECORDED))) +
#   geom_point()+
#   facet_wrap(~year) +
#   xlab("Calculated Vessel Length") +
#   ylab("Number of Years Length Was Recorded") +
#   theme(axis.text.x = element_blank())
# Frequency table of number of years' records
vessel_length_key_df %>% 
  count(N_YEARS_LENGTH_RECORDED) %>% 
  kable(col.names = c("Number of Records","Number of Vessels"))
Number of Records Number of Vessels
1 415
2 3403
3 4
8 1
13 1
NA 46

If plotting option above is TRUE, make some informative plots.

if(plot_output){
  vessel_length_key_df %>% 
    filter(!is.na(FINAL_LENGTH)) %>% 
    ggplot(aes(as.numeric(FINAL_LENGTH))) +
      geom_histogram(bins=10) +
      facet_wrap(~year) +
      labs(x="Vessel Length",y="Number of Vessels",title="Vessel Length Frequency Distribution")
}

if(plot_output){
  vessel_length_key_df %>% 
    ggplot(aes(vessel_length_key_df$TYPE_CALC)) +
    geom_bar() +
    facet_wrap(~year) +
    labs(x="Calculation Type",y="Number of Vessels")+
    theme(axis.text.x = element_text(angle=90, hjust=1))
}

Assess Missing Data for Fish Tickets

For each year, join vessel length data with landings. How many fish tickets are missing a vessel length, by agency? Non-specific.

for(y in unique(vessel_length_key_df$year)){
  landings_subset <- landings %>%
    filter(year == y)
  dcrb_landings_length <- left_join(landings_subset, filter(vessel_length_key_df, year == y), by=c("drvid", "agency_code"))
  for(a in unique(dcrb_landings_length$agency_code)){
    tmp_dat <- filter(dcrb_landings_length, agency_code == a)
    missing <- sum(is.na(tmp_dat$FINAL_LENGTH))
    cat("Number of", y, "fish tickets missing vessel lengths for", a, ":", missing, "\n")
    cat("Proportion:", missing/length(tmp_dat$FINAL_LENGTH), "\n\n")
  }
}
## Number of 2010 fish tickets missing vessel lengths for W : 536 
## Proportion: 0.02156681 
## 
## Number of 2010 fish tickets missing vessel lengths for O : 320 
## Proportion: 0.01298912 
## 
## Number of 2010 fish tickets missing vessel lengths for C : 160 
## Proportion: 0.002639959

Save Results

Vessel lengths key

for(y in unique(vessel_length_key_df$year)){
  key_subset <- vessel_length_key_df %>%
    filter(year == y)
  write_rds(key_subset,here::here('Confidential','processed','vessel length keys',paste0('vessel_length_key_', y, '.rds')))
}

Landings matched to vessel lengths

#save output with diff name to indicate FTID still in data
for(y in unique(vessel_length_key_df$year)){
  landings_subset <- landings %>%
    filter(year == y)
  vessel_length_key_df_subset <- vessel_length_key_df %>%
    filter(year==y) %>%
    select(-year)
  landings_length <- left_join(landings_subset,vessel_length_key_df_subset, by=c("drvid", "agency_code"))
  write_rds(landings_length,here::here('Confidential','processed','fish tickets',paste0(y, 'fishtix_vlengths_withFTID.rds')))
}
x<-proc.time()-alltime

So far, this pipeline for 2010 VMS data has taken 6.26 minutes to run.

Step 3: Process Raw VMS Data

Purpose

This step in data processing further cleans raw VMS data.

Steps:

The following changes needed to be made for further processing:

  1. Add in Cartesian geocoordinates (after checking for missing lat/lon)

  2. Add in unique record numbers for each VMS record

  3. Remove VMS records outside of a given latitude (assumed to be non-West Coast)

  4. Add in the bathymetry layer

  5. Remove high elevation points

  6. Reorder columns

Setup and Options

To process data, user can choose year and lat/lon bounds of VMS data points (currently set conservatively around US West Coast waters). Additionally, choose an upper bound for the bathymetry layer. Conservatively set at +100 ft., which should certainly retain all ocean areas.

# process_year = 2021
lon_upper = -117 # if you do not want this filter, set to NA
lon_lower = -132 # if you do not want this filter, set to NA
lat_lower = 32 # if you do not want this filter, set to NA
lat_upper = 50 # if you do not want this filter, set to NA
bathy_upper = 100

Read in Data

# Updated 2/21/2024 with new VMS files
vms_fn <- list.files(here('Confidential','raw','vms'),full.names = T)
vms_we_want <- vms_fn[grepl("chunk",vms_fn)&grepl(process_year,vms_fn)]

vms_raw <- purrr::map_df(vms_we_want,read_rds) %>% 
  filter(year==process_year)

Check for Missing Data

Are any records missing lat/lon?

missing_lat <- sum(is.na(vms_raw$LAT)); missing_lat
## [1] 0
# if(missing_lat > 0){
#   View(vms_raw %>% filter(is.na(LAT)))
# }

AFTER CHECKING ORIGINAL OLE FILE, delete the missing record(s).

vms_raw %<>% filter(!is.na(LAT))

Some of the latitude measurements are listed as negatives (likely due to the parsing process). Check how many observations this affects, and then change them to positive.

cat(sum(vms_raw$LAT<0),'records have negative latitude.')
## 4125 records have negative latitude.
vms_raw %<>% mutate(LAT = ifelse(LAT > 0, LAT, LAT*-1))

Add Cartesian Coordinates and Assign VMS Record Numbers

Convert lat/long to cartesian geocoordinates for UTM zone 10.

vms_coords <- vms_raw %>% 
  # convert to simple features point object
  st_as_sf(coords=c("LON","LAT"),crs="+proj=longlat +datum=WGS84") %>% 
  # project to UTM zone 10
  st_transform(crs = "+proj=utm +north +zone=10 +ellps=WGS84") %>% 
  # extract XY coordinates
  st_coordinates()

# add to data frame
vms_raw %<>%
  mutate(X_COORD = vms_coords[,1],
         Y_COORD = vms_coords[,2])

# validate with back-conversion to lat/long for a random selection of 10 records
test_coords <- vms_raw %>%
  sample_n(10) %>%
  dplyr::select(LON, LAT, X_COORD,Y_COORD)

test_coordsLL <- test_coords %>% 
  st_as_sf(coords=c("X_COORD","Y_COORD"),crs="+proj=utm +north +zone=10 +ellps=WGS84") %>% 
  st_transform(crs="+proj=longlat +datum=WGS84") %>% 
  st_coordinates()

# test whether coordinates are the same to 4 decimal places
all(round(test_coords[,c(1,2)],digits = 4)==round(test_coordsLL,digits=4))
## [1] TRUE

Add unique ID numbers for VMS records.

VMS_recno: select random integer values between X and Y, sampling without replacement. CHECK PREVIOUS YEARS OF DATA TO MAKE SURE THERE ARE NO DUPLICATE RECORD NUMBERS

set.seed(0401) # set a RNG seed so that we have reproducibility if/when code is re-run
vms <- vms_raw %>% 
  ungroup() %>% 
  mutate(VMS_RECNO=row_number()+process_year*1e7) %>% 
  mutate(VMS_RECNO=sample(VMS_RECNO))

# Did we make any duplicates for this year (by accident)?
!length(unique(vms$VMS_RECNO))==nrow(vms)
## [1] FALSE
# Nope, all VMS_RECNO unique

Remove Out-of-Bounds Records

This will delete records above the US-Canadian border (lat_upper) and below the US-Mexican border (lat_lower).

dim(vms)
## [1] 6634436      18
if(!is.na(lat_upper)){
  vms <- filter(vms, LAT < lat_upper)
} 
if(!is.na(lat_lower)){
  vms <- filter(vms, LAT > lat_lower)
}
dim(vms)
## [1] 5936948      18

This will delete records far out to sea, or inland.

dim(vms)
## [1] 5936948      18
if(!is.na(lon_upper)){
  vms <- filter(vms, LON < lon_upper)
} 
if(!is.na(lon_lower)){
  vms <- filter(vms, LON > lon_lower)
}
dim(vms)
## [1] 5908841      18

Remove Duplicate Records

Duplicates are any VMS records with the same: UTC_TIME, LAT, LON, VESSEL_NAME, DOCNUM.

Create data frame where duplicated record (second record) is removed.

dim(vms)
## [1] 5908841      18
tm <- proc.time()
vms_nodup <- vms %>% 
  distinct(UTC_TIME,LAT,LON,VESSEL_NAME,DOCNUM,.keep_all = TRUE)
proc.time()-tm
##    user  system elapsed 
##   1.470   0.265   1.739
cat("Proportion of VMS records removed for being true duplicate records:", 1-nrow(vms_nodup)/nrow(vms))
## Proportion of VMS records removed for being true duplicate records: 0.002603556

Save the duplicate entries to a file, to understand what data is being removed!

# janitor::get_dupes()
tm <- proc.time()
vms_dupes <- vms %>% 
  get_dupes(UTC_TIME,LAT,LON,VESSEL_NAME,DOCNUM) %>% 
  arrange(DOCNUM, UTC_TIME) %>% 
  dplyr::select(dupe_count,everything())
proc.time()-tm
##    user  system elapsed 
##  39.774   0.801  40.607
write_rds(vms_dupes, here::here('Confidential','processed','vms',paste0(process_year,'_duplicates_only.rds')))

Add Bathymetry

Read in the bathymetry object: Blake Feist’s 3 arc-second composite bathymetry

bathy.grid <- rast(here::here('data','bathymetry','composite_bath.tif'))

Get bathymetry at VMS data points

vms_sp <- vms_nodup %>% st_as_sf(coords=c("LON","LAT"),crs=4326)

vmsll <- st_coordinates(vms_sp)

bathy.points <- terra::extract(bathy.grid,vmsll)/10# convert to meters from decimeters

vms_nodup <- mutate(vms_nodup, NGDC_M = bathy.points[,1])

Remove high elevation bathymetry.

vms_nodup_bathy <- vms_nodup %>% filter(NGDC_M < bathy_upper)

cat('Filtering out records greater than',bathy_upper,'meters in elevation resulted in the removal of',nrow(vms_nodup)-nrow(vms_nodup_bathy),'records (',(nrow(vms_nodup)-nrow(vms_nodup_bathy))/nrow(vms_nodup)*100,'percent ).')
## Filtering out records greater than 100 meters in elevation resulted in the removal of 28192 records ( 0.478361 percent ).

View Output

Plot points. We plot a subset of 100 thousand VMS records to make plotting time reasonable.

vms_sf <- vms_nodup_bathy %>% st_as_sf(coords=c("X_COORD","Y_COORD"),crs = "+proj=utm +north +zone=10 +ellps=WGS84")

# Plotting takes a long time
# Try with a subset of 100k points
vms_sf_sample <- vms_sf %>% 
  sample_n(100000) %>% 
  filter(NGDC_M>-100000)

# coastline
coaststates <- ne_states(country='United States of America',returnclass = 'sf') %>% 
  filter(name %in% c('California','Oregon','Washington')) %>% 
  st_transform(st_crs(vms_sf))

ggplot()+
  geom_sf(data=coaststates,fill='gray50',col=NA)+
  geom_sf(data=vms_sf_sample,size=0.5,col='blue')+
  labs(x='Longitude',y='Latitude',title=paste0(process_year," VMS records"))

Organize Output

vms_ordered <- vms_nodup_bathy %>% 
  dplyr::select(UTCDATETIME,LAT,LON,NGDC_M,VESSEL_NAME,AVG_SPEED,AVG_COURSE,DOCNUM,DECLARATION_CODE,X_COORD,Y_COORD,VMS_RECNO)

Save Results

write_rds(vms_ordered,here::here('Confidential','processed','vms',paste0(process_year,'_vms_clean.rds')))
x<-proc.time()-alltime

So far, this pipeline for 2010 VMS data has taken 8.56 minutes to run.

Step 4: Match VMS and Fish Ticket Data

Purpose

This portion of the data processing matches the cleaned PacFIN fish tickets to the cleaned VMS data. We match the two datasets together using unique vessel identification numbers that are common to both datasets. We define a fishing “trip” as the collection of VMS observations directly preceding a landed fish ticket. Hence, all of the VMS pings that occur between two tickets are considered part of the latter ticket’s trip. We enforce a “lookback window” such that if there are more than X days [an option that can be changed] between successive tickets for the same vessel, we only include data for those X days. Look for the lookback window option in the master process script.

Later, the data are filtered and cleaned for depth, unrealistic speeds between pings, and corrected for multiple-ticket days.

Setup and Data Import

Choose year to process. (this is usually done in the top level script)

# process_year <- 2021

Load fish ticket data (processed), including the last month of the previous year (for handling trips spanning January 1st of the current year)

#Load in data with FTID and vessel length
fishtix <- read_rds(here::here('Confidential','processed','fish tickets',paste0(process_year,'fishtix_vlengths_withFTID.rds')))

#Note tho that atm some of the previous years don't have length and FTID as I haven't run that
if(process_year>2009){
  fishtix_prev <- read_rds(here::here('Confidential','processed','fish tickets',paste0(process_year-1,'fishtix_withFTID.rds'))) %>% 
    filter(month(date)==max(month(date)))
  fishtix <- bind_rows(fishtix_prev,fishtix)
  rm(fishtix_prev)
}

Load VMS data, including the last month of the previous year (for handling trips spanning January 1st of the current year)

vms <- read_rds(here::here('Confidential','processed','vms',paste0(process_year,'_vms_clean.rds')))

if(process_year>2009){
  vms_prev <- read_rds(here::here('Confidential','processed','vms',paste0(process_year-1,'_vms_clean.rds'))) %>% 
    filter(month(UTCDATETIME)==max(month(UTCDATETIME)))
  vms <- bind_rows(vms_prev,vms)
  rm(vms_prev)
}

Prep VMS Data for Matching

Add West Coast Dates

PacFIN reports in west coast time, so create two columns with date/time in Pacific time zone

vms %<>%
  # with time (hour, minute)
  mutate(westcoastdate = with_tz(UTCDATETIME, tzone = "America/Los_Angeles")) %>%
  # without time
  mutate(westcoastdate_notime = as_date(westcoastdate))

Prep Fish Ticket Data for Matching

Truncate fish ticket data to match VMS year, with the lookback window where appropriate. This should prevent fish tickets from the previous year being assigned the same VMS data points.

fishtix %<>%
  filter(date >= lookback_begin & date <= year_end) %>% 
  
  #remove unneeded date columns
  #Edited to keep FTID and final_length columns
  select(Rec_ID:date,TARGET_lbs:FINAL_LENGTH)

Add begin and end dates for each ticket to use in VMS matching. These define the windows of time in which we “search” for matching VMS pings. We also add a column for trip duration in days, for use in QA later.

ptm <- proc.time()
fishtix %<>%

  # group by vessel
  group_by(drvid) %>%
  
  #add month
  mutate(month=month(date)) %>% 
  
  # add a time for the end of day
  mutate(end = ymd_hms(paste(date, "23:59:59"),tz= "America/Los_Angeles")) %>%
  # arrange(date) %>%

  # find the ticket BEFORE each ticket and note its time. Instead of creating NAs for the first ticket of the year for each vessel, assign a default value of the first ticket minus the lookback window.
  mutate(prev_ticket_dttm=lag(end,order_by=end,default=first(end,order_by = end)-days(lookback_window))) %>%

  # add lookback window
  mutate(lookback_dttm=end-days(lookback_window)) %>%

  # choose "begin" date for each ticket as the LATEST in time between previous ticket date and lookback date
  mutate(begin=ifelse(lookback_dttm>prev_ticket_dttm,lookback_dttm,prev_ticket_dttm) %>% as_datetime(tz= "America/Los_Angeles")) %>%

  # add trip duration
  mutate(trip_dur=interval(begin,end)) %>% mutate(trip_dur=time_length(trip_dur,"days")) %>% 
  # select(-prev_ticket_dttm,-lookback_dttm) %>% 

  ungroup()
x<-proc.time()-ptm
cat('Defining lookback window for all data took',x[3]/60,'minutes to run.')
## Defining lookback window for all data took 0.6299333 minutes to run.

Match VMS to Fish Tickets

Here is where we do the actual matching of VMS data to fish tickets. Matching is done by vessel ID number and time, which are recorded in both datasets. We first join all VMS data for each vessel, then filter such that only the records within the appropriate window of time for each ticket (defined in the previous step) are retained.

ptm <- proc.time()

join_vms_fishtix <- function(mth){
  
  tmptix <- fishtix %>% 
    filter(month==mth)
  
  tmpvms <- vms %>% filter(month(westcoastdate)%in%c(mth-1,mth))
  
  tmptix %>%
    # Left-join the VMS data by Vessel ID (drvid/DOCNUM)
    left_join(tmpvms,by=c('drvid'='DOCNUM')) %>% 
    
    # filter the VMS data such that the data for each trip is correctly truncated to fall within the lookback window. 
    # If any tickets did NOT match VMS data, they will not have associated dates from VMS.
    # We retain these, and then produce a boolean indicator denoting that they did not match.
    
    filter((westcoastdate >= begin & westcoastdate <= end)| is.na(westcoastdate)) %>%
    mutate(has_vms=ifelse(is.na(westcoastdate),0,1)) %>% 
    ungroup()
}

fishtix_vms <- purrr::map_df(unique(fishtix$month),~join_vms_fishtix(.)) %>% 
  # finally, filter the edge case such that we remove trips ENDING before January 1st.
  # this will retain the VMS records that are before Jan 1 for any trips ending on or after Jan 1
    filter(year(date)==process_year)


## Join vms to fishtix
# fishtix_vms <- fishtix %>%
#   
#   # Left-join the VMS data by Vessel ID (drvid/DOCNUM)
#   left_join(vms,by=c('drvid'='DOCNUM')) %>% 
#   
#   # filter the VMS data such that the data for each trip is correctly truncated to fall within the lookback window. 
#   # If any tickets did NOT match VMS data, they will not have associated dates from VMS.
#   # We retain these, and then produce a boolean indicator denoting that they did not match.
#   
#   filter((westcoastdate >= begin & westcoastdate <= end)| is.na(westcoastdate)) %>%
#   mutate(has_vms=ifelse(is.na(westcoastdate),0,1)) %>% 
#   ungroup()
# 
x<-proc.time()-ptm
cat('Joining VMS data and filtering took',x[3]/60,'minutes to run for',process_year)
## Joining VMS data and filtering took 0.5054667 minutes to run for 2010

Visualize Matching Results

We now have the full matched dataset, including retaining 79871 fish ticket records with no associated VMS data.

# test<- fishtix_vms %>% 
#   select(Rec_ID,trip_dur,has_vms) %>%
#   mutate(trip_dur=as.integer(trip_dur)) %>% 
#   distinct() %>% 
#   ungroup() %>% 
#   mutate(has_vms=ifelse(has_vms,"VMS Trips","Non-VMS Trips"))

# How long are trips?
fishtix_vms %>% 
  select(Rec_ID,trip_dur,has_vms) %>%
  mutate(trip_dur=as.integer(trip_dur)) %>% 
  distinct() %>% 
  ungroup() %>% 
  mutate(has_vms=ifelse(has_vms,"VMS Trips","Non-VMS Trips")) %>% 
  ggplot(aes(trip_dur,fill=has_vms))+
  geom_bar(position = 'dodge')+
  scale_x_discrete(limits=seq(0,7,by=1))+
  labs(x="Trip Duration (Days)",y="Number of Trips",title="Trip Duration",fill="")

# Count how many trips have multiple species (measured as multiple targets)
fishtix_vms %<>% mutate(multispecies=ifelse(nchar(species_code_all)>4,1,0))

fishtix_vms %>% 
  select(Rec_ID,multispecies,has_vms) %>%
  distinct() %>% 
  mutate(has_vms=ifelse(has_vms,"VMS Trips","Non-VMS Trips")) %>% 
  ggplot(aes(factor(multispecies),fill=factor(multispecies)))+
  scale_fill_discrete(labels=c('No',"Yes"))+
  geom_bar()+
  facet_wrap(~has_vms)+
  labs(x="",y="Number of Fish Tickets",title="Number of Multispecies Trips",fill="Multispecies\nTrip")+
  theme(axis.text.x = element_blank())

# Distribution of number of trips per vessel, VMS vs. non-VMS records
fishtix_vms %>% 
  group_by(drvid,has_vms) %>% 
  summarise(ntix=length(unique(Rec_ID))) %>% 
  ungroup() %>% 
  mutate(has_vms=ifelse(has_vms,"VMS Vessels","Non-VMS Vessels")) %>% 
  ggplot(aes(ntix,after_stat(count)))+
  geom_histogram(bins=10)+
  facet_wrap(~has_vms)+
  labs(x=paste("Number of Trips in",process_year),y="Number of Vessels",title="Distribution of Number of Recorded Trips per Vessel")

# adjust for inflation based on reference year using Fred deflators (added 8May2024)
# left_join(gdp_defl, by = YEAR) %>% #left_join(gdp_defl, by = c("year in fishtix_vms name" = "YEAR")
#   mutate(AFI_REVENUE = EXVESSEL_REVENUE / DEFL)

Save Results

Save a version that has all fish ticket data, including records with no VMS match, as well as a filtered dataset that excludes non-VMS records.

fishtix_vms %<>% select(-month)
fishtix_vms_all <- fishtix_vms
fishtix_vms_only <- fishtix_vms %>% filter(has_vms==1)
#Save versions that have FTID and length
write_rds(fishtix_vms_all,here::here('Confidential','processed','matched','matching',paste0(process_year,'_matched_alltix_withFTID.rds')))
write_rds(fishtix_vms_only,here::here('Confidential','processed','matched','matching',paste0(process_year,'_matched_vmstix_only_withFTID.rds')))
x<-proc.time()-alltime
# cat('This step for',process_year,'VMS data took',round(x[3]/60,2),'minutes to run.')

So far, this pipeline for 2010 VMS data has taken 10.27 minutes to run.

Step 5: Filter Matched Data by Speed and Proximity to Ports

Purpose

This final step cleans up the matched fish ticket/VMS records spatially through filtering. The script:

  • Removes outlier trips where the Port of Landing is > 50km from the last VMS point of the associated trip

  • Recalculates average speed for each trip and filters out segments with unrealistically large speed. This speed cutoff is set by default at 20 m/s, but can be altered in this script.

  • Flags VMS points that have a bathymetry value > 0

  • Flags VMS points that are within a 3-kilometer buffer of a port, so we can remove points that are not directly associated with a fishing trip (e.g., because the boat is sitting in port between trips)

Setup and Data Import

Read in the data

First, the full vms data set for 2010.

vms <- read_rds(here::here('Confidential','processed','matched','matching',paste0(process_year,'_matched_vmstix_only_withFTID.rds')))

Remove trips where Port of Landing > 50km from last VMS data point

Latitude/longitude of ports. Coordinates were provided by Blake Feist.

portlist_coords <- read_csv(here::here('data','port_coords_fromBlake_edited.csv'),col_types='cddd') %>% 
  select(port_code,Lon,Lat) %>% 
  set_names(c('port_code','portlon','portlat'))

Filter VMS data to include only the last VMS data point for each trip

fishtix_lastVMS <- vms %>%
  group_by(Rec_ID) %>%
  top_n(1, wt=westcoastdate) %>%
  ungroup()

Add port locations to last VMS points

fishtix_lastVMS <- left_join(fishtix_lastVMS, portlist_coords, by=c("pacfin_port_code" = "port_code"))

How many port lat/lon coordinates are NA? For 2010, 3.15 percent of trips are missing coordinates. We remove these observations.

Note: why is this happening?

fishtix_lastVMS %<>% filter(!is.na(portlat),!is.na(portlon))

Find distance between end VMS and port

Calculate distance with geosphere

port_dists <- geosphere::distHaversine(p1=cbind(fishtix_lastVMS$portlon, fishtix_lastVMS$portlat),
                                    p2=cbind(fishtix_lastVMS$LON, fishtix_lastVMS$LAT))

Add the distances as another column

fishtix_lastVMS %<>%
  mutate(port_to_VMS = port_dists / 1000)

Sort Trips to Keep or Remove

trips_keep_remove_portdist <- fishtix_lastVMS %>% 
  mutate(keep_remove_portdist= ifelse(port_to_VMS <= 50, 'keep','remove'))
trips_to_keep <-trips_keep_remove_portdist %>% 
  filter(keep_remove_portdist=='keep')
trips_to_remove <- trips_keep_remove_portdist %>% 
  filter(keep_remove_portdist=='remove')

For 2010, if we filter using this criterion, we retain 22288 trips, and remove 735 trips, which is about 3.19 percent of all trips.

retained_trips_plot <- trips_to_keep %>% 
  select(Rec_ID, pacfin_port_code) %>% 
  distinct() %>%
  ggplot(aes(x=pacfin_port_code)) +
  geom_bar() +
  labs(x="Port",y="Number of Unique Trips",title="Retained Trips Ending <= 50km from a Port")+
  theme_minimal()+
  theme(axis.text.x = element_text(angle=90,vjust=0.5))
retained_trips_plot

trips_removed_plot <- trips_to_remove %>% 
  select(Rec_ID, pacfin_port_code) %>% 
  distinct() %>%
  ggplot(aes(x=pacfin_port_code)) +
  geom_bar() +
  labs(x="Port",y="Number of Trips Removed",title="Removed Trips Ending > 50km from a Port")+
  theme_minimal()+
  theme(axis.text.x = element_text(angle=90,vjust=0.5))
trips_removed_plot

Sort the VMS and fish ticket data by the output of this filtering exercise.

Default behavior for calculations that return NA is to keep the record.

vms_wfilters <- vms %<>% 
  left_join(trips_keep_remove_portdist %>% select(Rec_ID,keep_remove_portdist),by='Rec_ID') %>%
  distinct(VMS_RECNO,UTCDATETIME,.keep_all = T) %>% 
  mutate(keep_remove_portdist=replace_na(keep_remove_portdist,'keep'))

Calculate Speeds

We calculate speeds by lagging observations (grouped by vessel) by 1 time step, then calculating distance and dividing by time. Note, then, that these are looking backwards, not forwards. In other words, the distance and time duration for a given VMS ping are the values associated with the segment ending at that ping.

vms_speeds <- vms_wfilters %>% 
  ungroup() %>% 
  group_by(drvid) %>% 
  # lag latitude and longitude by 1 time step
  mutate(laglon=lag(LON,1,order_by=westcoastdate),laglat=lag(LAT,1,order_by=westcoastdate)) %>% 
  # lag time by 1 time step
  mutate(lagdate=lag(westcoastdate,1,order_by=westcoastdate)) %>% 
  # calculate duration since last ping, in seconds
  mutate(segment_dur=as.duration(lagdate %--% westcoastdate)/dseconds()) %>% 
  ungroup()

# Calculate distance (Note: geosphere seems much faster than doing this with sf())
segment_dists <- geosphere::distHaversine(p1=cbind(vms_speeds$LON, vms_speeds$LAT),
                                    p2=cbind(vms_speeds$laglon, vms_speeds$laglat))

vms_speeds %<>% 
  mutate(segment_dist=segment_dists)

# Speed is just segment distance (default in meters) divided by segment duration (in seconds)
vms_speeds %<>%
  mutate(avg_speed_recalc=segment_dist/segment_dur) %>% 
  # some calculations will be NaN or Inf because of 0 distance or time. Fix these as zeroes
  mutate(avg_speed_recalc=ifelse(segment_dist==0|segment_dur==0,0,avg_speed_recalc))

Compare Reported and Calculated Speeds

Using a subsample of 100 thousand records here.

set.seed(0401)
vms_subsample <- vms_speeds %>% 
  sample_n(100000) %>% 
  mutate(speed_diff=avg_speed_recalc-AVG_SPEED)
vms_subsample %>% 
  ggplot(aes(AVG_SPEED,avg_speed_recalc))+
  geom_point()+
  geom_smooth(method='lm')+
  coord_equal(xlim=c(0,500),ylim=c(0,500))+
  labs(x='Original Average Speed',y="Average Speed Recalculated")

Zoom in a bit more

vms_subsample %>% 
  ggplot(aes(AVG_SPEED,avg_speed_recalc))+
  geom_point()+
  geom_smooth(method='lm')+
  coord_equal(xlim=c(0,50),ylim=c(0,50))+
  labs(x='Original Average Speed',y="Average Speed Recalculated")

Distribution of differences

vms_subsample %>% 
  mutate(speed_diff=avg_speed_recalc-AVG_SPEED) %>% 
  ggplot(aes(speed_diff))+
  geom_density(fill='seagreen',col=NA)+
  geom_vline(xintercept=0,linetype=2)+
  labs(x="Calculated minus Reported Speed (m/s)",y='kernel density')+
  xlim(-5,5)

80.75 percent of our re-calculations are within 1 m/s of the reported average speeds. Beyond that, it seems that the reported average speeds tend on average to be greater than our calculated speeds.

Sort Trips to Keep or Remove

max_speed <- 20 # meters per second

This time, indicate whether our recalculated speed is greater than 20 m/s. Add these indicators to the master, matched dataset. Default behavior when calculation returns NA is to keep the record.

vms_wfilters <- vms_speeds %>% mutate(keep_remove_speed=ifelse(avg_speed_recalc>max_speed,'remove','keep')) %>% 
  mutate(keep_remove_speed=replace_na(keep_remove_speed,"keep"))

If we filter using the speed criterion, we would remove 0.02 percent of all observations for 2010.


Indicate Pings on Land

Add another keep_remove indicator showing whether a given VMS ping may be on land.

vms_wfilters %<>% mutate(keep_remove_bathy=ifelse(NGDC_M<=0,"keep","remove"))

If we filter using the on-land criterion, we would remove 20.25 percent of all observations for 2010.


Remove In-Port VMS Records

In this piece, we flag all “in-port” VMS records, using a buffer zone of 1.5km or 3km, depending on the port. In-port records that also have an average speed of < 1 m/s are marked for removal. Only those records marked for removal before and after the fishing trip were actually removed from the filtered output data.

## width of buffer circle (in meters). applies to all ports but those that require smaller buffers
r = 3000

## port codes for any ports that require reduced buffer sizes. Default is only COS
ports_lowbuffer <- c("COS")

## width of buffer circle (in meters) for reduced buffer size
r2 = 1500

## cutoff value for speed (m/s) -- 1m/s ~ 2 knots
speed_cutoff <- 1

Using port coordinates from above, indicated whether they are in the “smaller buffer” category

portlist_coords %<>%
  mutate(small_buffer=ifelse(port_code %in% ports_lowbuffer,1,0) %>% as.logical())

Buffer Ports

# convert ports to sf() object and buffer
ports_sf <- portlist_coords %>% 
  st_as_sf(coords=c('portlon','portlat'),crs=4326) %>% 
  #project to UTM zone 10
  st_transform(crs = "+proj=utm +north +zone=10 +ellps=WGS84")

# buffer large ports
large_ports_buffer <- ports_sf %>%
  filter(!small_buffer) %>% 
  st_buffer(r)

# buffer small ports
small_ports_buffer <- ports_sf %>% 
  filter(small_buffer) %>% 
  st_buffer(r2)

ports_buffer <- rbind(large_ports_buffer,small_ports_buffer)

Plot Port Buffers

# import a background/land layer from rnaturalearth package
coaststates <- ne_states(country='United States of America',returnclass = 'sf') %>% 
  filter(name %in% c('California','Oregon','Washington')) %>% 
  
  # make sure CRS is the same as the port layer
  st_transform(st_crs(ports_buffer))

# sections of coastline
cutcoast <- list(c(-124, -125, 47.5, 48.5), #upper WA
                 c(-123.5, -124.5, 46.7, 47.7), #lower WA p1
                 c(-123.5, -124.5, 45.7, 46.7), #lower WA / upper OR [WLB, NHL]
                 c(-123.5, -124.5,45.6, 44.6), # OR [TLL, NEW]
                 c(-123.5, -124.5,44.5, 43.5), # OR [WLD, WIN]
                 c(-124, -125, 43.5, 42.25), # OR [COS,GLD]
                 c(-123.75, -124.75, 42.2, 40.5), #OR to CA [BRK, FLN]
                 c(-123, -124, 40, 38.5), # CA [BRG, ARE]
                 c(-122, -123.5, 38.5, 36.6), # CA [BDG, CRZ]
                 c(-120.5, -122, 37, 35), # CA [MOS, AVL]
                 c(-117, -120, 35, 32.5)) # CA[SB, OCN]

# Plot
buffer_plots <- purrr::map(1:length(cutcoast), function(i){
  seg=cutcoast[[i]]
  bx=c(xmin=seg[2],xmax=seg[1],ymin=seg[3],ymax=seg[4])%>% st_bbox(crs=st_crs(4326))
  bbox <- bx %>% st_as_sfc() %>% st_transform(st_crs(ports_buffer)) %>% st_bbox()
  plotout <- ggplot()+
    geom_sf(data=coaststates,fill='gray50')+
    geom_sf(data=ports_buffer,aes(fill=small_buffer),alpha=0.5)+
    geom_sf_text(data=ports_buffer,aes(label=port_code),size=3,hjust=1)+
    xlim(bbox[1],bbox[3])+ylim(bbox[2],bbox[4])+
    labs(x='',y='',title="Port Buffers",fill="Small Buffer")+
    theme(axis.text.x = element_text(angle=90))
  print(plotout)
})

Do Spatial Overlay

Calculate whether points overlap buffers

vms_sf <- vms_wfilters %>% 
  select(VMS_RECNO,X_COORD,Y_COORD) %>%
  st_as_sf(coords=c("X_COORD","Y_COORD"),crs = "+proj=utm +north +zone=10 +ellps=WGS84")

# do the overlay, which results in a list of buffer indices into which a VMS point falls
#https://github.com/r-spatial/sf/wiki/migrating
port_names <- ports_buffer$port_code
# vms_ports_over_list <- st_intersects(vms_sf,ports_buffer)
vms_ports_over <- sapply(st_intersects(vms_sf,ports_buffer), function(z) if (length(z)==0) NA_integer_ else z[1])

# Overlay returned integers. Pull out the appropriate port names from the list of ports
vms_ports_over_names <- port_names[vms_ports_over]

Add the port names back to the master dataset for each VMS point.

vms_wfilters %<>%
  mutate(in_port=vms_ports_over_names)

Check Output

## Plot a sample of VMS points
vms_sf %<>%
  mutate(in_port=vms_ports_over_names)

test_port<-vms_sf %>% 
  filter(in_port=='BDG') %>% 
  #subsample for quicker plotting
  sample_n(.,min(nrow(.),1000))
testbbox <- st_bbox(ports_buffer %>% filter(port_code=='BDG'))
# testbbox <- st_bbox(test_port)
ggplot()+
  geom_sf(data=coaststates,fill='gray50')+
  geom_sf(data=ports_buffer,fill='seagreen',alpha=0.5)+
  geom_sf(data=test_port,size=1)+
  xlim(testbbox[1],testbbox[3])+ylim(testbbox[2],testbbox[4])+
  labs(x='',y='',title="Points within Bodega Buffer")+
  theme(axis.text.x = element_text(angle=90))

What is the avg speed when vessels are marked as in port?

inport_dat <- vms_wfilters %>% mutate(in_port_binary = ifelse(is.na(in_port), FALSE,TRUE))

ggplot(data=filter(inport_dat, avg_speed_recalc < 50), aes(x=avg_speed_recalc,fill=factor(in_port_binary))) +
  geom_histogram() +
  facet_wrap(~in_port_binary)+
  labs(x="Average Speed (Calculated)",y="Number of Records",fill="In Port")

Sort In-Port Segments to Keep or Remove

Mark “remove” if the record is in port and the avg speed is < 1 m/s. For this filter, we use our recalculated speeds from the previous step.

vms_wfilters %<>% 
  mutate(keep_remove_inport = ifelse(!is.na(in_port) & avg_speed_recalc < speed_cutoff, "remove", "keep"))

round(sum(vms_wfilters$keep_remove_inport=='remove',na.rm=T)/nrow(vms_wfilters)*100,2)
## [1] 65.77

If we filter using the in-port criterion, we would remove 65.77 percent of all observations for 2010.

We want to remove all ‘remove’ points except the last in-port record before the fishing trip and the first in-port record after the fishing trip. For now, we will retain all mid-trip “remove” points.

We find these records for each trip. We sort by date and by whether the vessel is in port. We can then sort all of the records that are before or after the fishing trip. We add a final keep_remove_ column that indicates whether a VMS ping is truly part of a “trip” according to the criterion described here.

As a default, we retain pings. That is, if a trip does not have a first or last in-port indicator, it is marked as in-trip

vms_wfilters %<>%
  
  # organize by trip and date
  group_by(Rec_ID) %>% 
  arrange(westcoastdate) %>% 
  
  # indicate the dates of the first and last non-in-port records
  mutate(first_keep=first(westcoastdate[keep_remove_inport=='keep']),
         last_keep=last(westcoastdate[keep_remove_inport=='keep'])) %>% 
  
  # indicate the dates of the last pre-trip in-port record and the first post-trip in-port record
  mutate(last_pretrip=last(westcoastdate[westcoastdate<first_keep]),
         first_posttrip=first(westcoastdate[westcoastdate>last_keep])) %>% 
  
  # add a filtering column indicating whether a ping is part of the trip
  ungroup() %>% 
  mutate(keep_remove_not_intrip=ifelse(westcoastdate >= last_pretrip & westcoastdate <= first_posttrip,'keep','remove')) %>%
  # change NAs to default behavior of keep
  mutate(keep_remove_not_intrip=replace_na(keep_remove_not_intrip,'keep'))

View flagged points

For a random set of 5 trips, view the results of these filters.

set.seed(0401)
random_trips <- vms_wfilters %>% filter(TARGET_rev %in% spp_codes) %>% distinct(Rec_ID) %>% sample_n(5) %>% pull(Rec_ID)
testtrips <- vms_wfilters %>% 
  filter(Rec_ID %in% random_trips) %>% 
  select(Rec_ID,VMS_RECNO,X_COORD,Y_COORD,contains('keep_remove')) %>% 
  select(-keep_remove_inport) %>% 
  #pivot to indicate, if a point was removed, which step it was removed in
  pivot_longer(contains('keep_remove'),names_to = 'category',values_to = 'keep_remove') %>% 
  group_by(VMS_RECNO) %>% 
  # pull out the FIRST reason why a point is flagged for removal (port distance, speed, bathymetry, or in-port)
  mutate('reason_removed'=first(category[keep_remove=='remove'],default='kept')) %>% 
  ungroup() %>% 
  mutate(reason_removed=str_replace(reason_removed,'keep_remove_','')) %>% 
  mutate(reason_removed=as.factor(reason_removed)) %>% 
  select(Rec_ID,VMS_RECNO,X_COORD,Y_COORD,reason_removed) %>% 
  distinct() %>% 
  #convert to spatial
  st_as_sf(coords=c("X_COORD","Y_COORD"),crs = "+proj=utm +north +zone=10 +ellps=WGS84") %>% 
  # jitter a tiny bit
  st_jitter()

#plot
testtrip_plots <- purrr::map(random_trips, function(x){
  testtrip <- testtrips %>% filter(Rec_ID==x)
  bbox <- st_bbox(testtrip)
  plotout <- ggplot()+ 
    geom_sf(data=coaststates,fill='gray50')+
    geom_sf(data=ports_buffer,fill='purple',alpha=0.2)+
    geom_sf(data=testtrip,aes(col=reason_removed),key_glyph='point',alpha=1)+
    xlim(bbox[1],bbox[3])+ylim(bbox[2],bbox[4])+
    labs(x="",y="",color="Reason\nRemoved")+
    scale_color_discrete(drop=FALSE)+
    theme(axis.text.x = element_text(angle=90)) 
  print(plotout)
})

Check Effects of Filters

We now have a list of all VMS points with indicators of whether they are/will be filtered out because of the four reasons described above (port distance of last ping, ludicrous speeds, positive bathymetry, and/or between-trip pings within ports).

We can organize these reasons and look at the overall effects of filtering.

# on the individual ping level, measure reasons for removal
vms_dataloss <- vms_wfilters %>%
  select(Rec_ID,VMS_RECNO,X_COORD,Y_COORD,contains('keep_remove')) %>% 
  select(-keep_remove_inport) %>% 
  distinct() %>% 
  #pivot to indicate, if a point was removed, which step it was removed in
  pivot_longer(contains('keep_remove'),names_to = 'category',values_to = 'keep_remove') %>% 
  group_by(VMS_RECNO) %>% 
  # pull out the FIRST reason why a point is flagged for removal (port distance, speed, bathymetry, or in-port)
  mutate('reason_removed'=first(category[keep_remove=='remove'],default='kept'),
         num_removals=sum(keep_remove=='remove',na.rm=T)) %>% 
  ungroup() %>% 
  mutate(reason_removed=str_replace(reason_removed,'keep_remove_','')) %>% 
  select(Rec_ID,VMS_RECNO,X_COORD,Y_COORD,reason_removed,num_removals) %>% 
  distinct()

vms_dataloss_table <- vms_dataloss %>% 
  mutate(total_records=n()) %>% 
  group_by(reason_removed,total_records) %>% 
  summarise(n_removed=n()) %>% 
  ungroup() %>% 
  filter(reason_removed!='kept') %>% 
  mutate(stepnum=case_when(
    reason_removed=="portdist" ~ 1,
    reason_removed=="speed" ~2,
    reason_removed=="bathy"~3,
    reason_removed=="not_intrip"~4
    )) %>% 
  arrange(stepnum) %>% 
  mutate(tot_removed=cumsum(n_removed)) %>% 
  mutate(n_left=total_records-tot_removed) %>% 
  mutate(prop_remain=1-tot_removed/total_records) %>% 
  add_row(reason_removed='start',stepnum=0,prop_remain=1)

vms_dataloss_table %>% 
  ggplot(aes(stepnum,prop_remain))+
  geom_point()+
  geom_line()+
  scale_y_continuous(limits=c(0,1))+
  labs(x="Filtering Step",y="Proportion Remaining Observations")+
  theme(panel.grid.major = element_line())

vms_dataloss_table %>% 
  ggplot(aes(reorder(reason_removed,stepnum),prop_remain))+
  geom_col(width=0.5)+
  scale_y_continuous(limits=c(0,1))+
  labs(x="Filtering Step",y="Proportion Remaining Observations")+
  theme(panel.grid.major = element_line())

Have we removed > 99% of records from any trips?

check_removes <- vms_dataloss %>% 
  group_by(Rec_ID) %>% 
  mutate(remove=num_removals>0) %>% 
  summarise(prop_removed=sum(remove)/n()) %>% 
  ungroup()

893 trips were removed entirely, and 898 had more than 99 percent of their records removed, out of 23773 total trips.

Do Filter and Save Results

Finally, we do the actual filtering. We save a filtered version of the data for the next steps, but retain the full VMS-matched dataset with the filtering tags we’ve produced here, for later QA/QC.

vms_filtered <- vms_dataloss %>% 
  select(VMS_RECNO,num_removals) %>% 
  right_join(vms_wfilters,by=join_by(VMS_RECNO)) %>% 
  mutate(remove=num_removals>0) %>% 
  #final filter
  filter(!remove) %>% 
  distinct(Rec_ID,UTCDATETIME,.keep_all = T) %>% 
  
  # clean up, removing filtering columns
  select(VMS_RECNO:multispecies,segment_dur,avg_speed_recalc,in_port) %>% 
  select(-num_removals)

Save the output and the full, non-filtered data set

#Save version with FTID and length
write_rds(vms_filtered,here::here('Confidential','processed','matched','filtering',paste0(process_year,'_matched_filtered_withFTID_length.rds')))
write_rds(vms_wfilters,here::here('Confidential','processed','matched','filtering',paste0(process_year,'_matched_unfiltered.rds')))
# distribution of segment duration? in minutes
vms_filtered %>% ggplot(aes(segment_dur))+geom_density()+xlim(0,100)

x<-proc.time()-alltime

So far, this pipeline for 2010 VMS data has taken 14.06 minutes to run.